Part of Udacity Data Analyst Nanodegree

Introduction

For the EDA project, the data set is the Prosper Loan Data. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. The proper variable definations for the data can be found here.

Data

Let’s start this exploration with loading the data and saving the data into a dataframe to be inspected.

## [1] 113937

Considering that this data has 81 columns and 113937 rows, we can easily choose a chunk of this data to do analysis on. I shall start with dividing the data set into 2 considering the scenario pre-2009 and post-2009. This is also how the data is divided as columns such as the CreditGrade are only applicable for loans pre-July 2009. Thus, this data obviously tells a story of the pre-and-post 2009 scenarios and that is what I shall try to explore in this analysis.

Giving a little background on my decision to divide the data is important. From 2007 to mid 2009, the Great Financial Crisis occured followed by the Great Recession. The precipitating factor for the this Crisis was a high default rate in the United States subprime home mortgage sector i.e home loans. Critics argued that credit rating agencies and investors failed to accurately price the risk involved with mortgage-related financial products, and that governments did not adjust their regulatory practices to address 21st-century financial markets.

Moreover, accouding to Wikipedia, on November 24, 2008, the SEC found Prosper to be in violation of the Securities Act of 1933. As a result of these findings, the SEC imposed a cease and desist order on Prosper. In July 2009, Prosper reopened their website for lending (“investing”) and borrowing after having obtained SEC registration for its loans (“notes”). After the relaunch, bidding on loans was restricted to residents of 28 U.S. states and the District of Columbia. Borrowers may reside in any of 47 states, with residents of three states (Iowa, Maine, and North Dakota) not permitted to borrow through Prosper. This also highlights the divide in the data.

Thus, I will divide the data set into the pre and post financial crisis data set to get an accurate depection of the differences, specially, the credit grade and the numerous columns such as prosper score found only after 2009.

Let’s check the two dataframes

PRE JUNE 2009

## [1] 28973
##                       ListingNumber                         CreditGrade 
##                                   0                                  20 
##                                Term                          LoanStatus 
##                                   0                                   0 
##                          ClosedDate                         BorrowerAPR 
##                                   0                                  25 
##                        BorrowerRate                         LenderYield 
##                                   0                                   0 
##             EstimatedEffectiveYield                       EstimatedLoss 
##                               28957                               28957 
##                     EstimatedReturn             ProsperRating..numeric. 
##                               28957                               28957 
##               ProsperRating..Alpha.                        ProsperScore 
##                               28957                               28957 
##           ListingCategory..numeric.                       BorrowerState 
##                                   0                                5515 
##                          Occupation                    EmploymentStatus 
##                                2255                                2255 
##            EmploymentStatusDuration                 IsBorrowerHomeowner 
##                                7606                                   0 
##                    CurrentlyInGroup                            GroupKey 
##                                   0                               17678 
##               CreditScoreRangeLower               CreditScoreRangeUpper 
##                                 591                                 591 
##             FirstRecordedCreditLine                  CurrentCreditLines 
##                                 697                                7604 
##                     OpenCreditLines          TotalCreditLinespast7years 
##                                7604                                 697 
##               OpenRevolvingAccounts         OpenRevolvingMonthlyPayment 
##                                   0                                   0 
##                InquiriesLast6Months                      TotalInquiries 
##                                 697                                1159 
##                CurrentDelinquencies                    AmountDelinquent 
##                                 697                                7622 
##             DelinquenciesLast7Years            PublicRecordsLast10Years 
##                                 990                                 697 
##           PublicRecordsLast12Months              RevolvingCreditBalance 
##                                7604                                7604 
##                 BankcardUtilization             AvailableBankcardCredit 
##                                7604                                7544 
##                         TotalTrades  TradesNeverDelinquent..percentage. 
##                                7544                                7544 
##             TradesOpenedLast6Months                   DebtToIncomeRatio 
##                                7544                                1248 
##                         IncomeRange                    IncomeVerifiable 
##                                   0                                   0 
##                 StatedMonthlyIncome                   TotalProsperLoans 
##                                   0                               26730 
##          TotalProsperPaymentsBilled               OnTimeProsperPayments 
##                               26730                               26730 
## ProsperPaymentsLessThanOneMonthLate     ProsperPaymentsOneMonthPlusLate 
##                               26730                               26730 
##            ProsperPrincipalBorrowed         ProsperPrincipalOutstanding 
##                               26730                               26730 
##         ScorexChangeAtTimeOfListing           LoanCurrentDaysDelinquent 
##                               26731                                   0 
##       LoanFirstDefaultedCycleNumber          LoanMonthsSinceOrigination 
##                               18272                                   0 
##                          LoanNumber                  LoanOriginalAmount 
##                                   0                                   0 
##                 LoanOriginationDate              LoanOriginationQuarter 
##                                   0                                   0 
##                  MonthlyLoanPayment                 LP_CustomerPayments 
##                                   0                                   0 
##        LP_CustomerPrincipalPayments                  LP_InterestandFees 
##                                   0                                   0 
##                      LP_ServiceFees                   LP_CollectionFees 
##                                   0                                   0 
##               LP_GrossPrincipalLoss                 LP_NetPrincipalLoss 
##                                   0                                   0 
##     LP_NonPrincipalRecoverypayments                       PercentFunded 
##                                   0                                   0 
##                     Recommendations          InvestmentFromFriendsCount 
##                                   0                                   0 
##         InvestmentFromFriendsAmount                           Investors 
##                                   0                                   0 
##                                Date 
##                                   0

** POST 2009 DF **

## [1] 84964
##                       ListingNumber                         CreditGrade 
##                                   0                               84964 
##                                Term                          LoanStatus 
##                                   0                                   0 
##                          ClosedDate                         BorrowerAPR 
##                               58848                                   0 
##                        BorrowerRate                         LenderYield 
##                                   0                                   0 
##             EstimatedEffectiveYield                       EstimatedLoss 
##                                 127                                 127 
##                     EstimatedReturn             ProsperRating..numeric. 
##                                 127                                 127 
##               ProsperRating..Alpha.                        ProsperScore 
##                                 127                                 127 
##           ListingCategory..numeric.                       BorrowerState 
##                                   0                                   0 
##                          Occupation                    EmploymentStatus 
##                                1333                                   0 
##            EmploymentStatusDuration                 IsBorrowerHomeowner 
##                                  19                                   0 
##                    CurrentlyInGroup                            GroupKey 
##                                   0                               82918 
##               CreditScoreRangeLower               CreditScoreRangeUpper 
##                                   0                                   0 
##             FirstRecordedCreditLine                  CurrentCreditLines 
##                                   0                                   0 
##                     OpenCreditLines          TotalCreditLinespast7years 
##                                   0                                   0 
##               OpenRevolvingAccounts         OpenRevolvingMonthlyPayment 
##                                   0                                   0 
##                InquiriesLast6Months                      TotalInquiries 
##                                   0                                   0 
##                CurrentDelinquencies                    AmountDelinquent 
##                                   0                                   0 
##             DelinquenciesLast7Years            PublicRecordsLast10Years 
##                                   0                                   0 
##           PublicRecordsLast12Months              RevolvingCreditBalance 
##                                   0                                   0 
##                 BankcardUtilization             AvailableBankcardCredit 
##                                   0                                   0 
##                         TotalTrades  TradesNeverDelinquent..percentage. 
##                                   0                                   0 
##             TradesOpenedLast6Months                   DebtToIncomeRatio 
##                                   0                                7306 
##                         IncomeRange                    IncomeVerifiable 
##                                   0                                   0 
##                 StatedMonthlyIncome                   TotalProsperLoans 
##                                   0                               65122 
##          TotalProsperPaymentsBilled               OnTimeProsperPayments 
##                               65122                               65122 
## ProsperPaymentsLessThanOneMonthLate     ProsperPaymentsOneMonthPlusLate 
##                               65122                               65122 
##            ProsperPrincipalBorrowed         ProsperPrincipalOutstanding 
##                               65122                               65122 
##         ScorexChangeAtTimeOfListing           LoanCurrentDaysDelinquent 
##                               68278                                   0 
##       LoanFirstDefaultedCycleNumber          LoanMonthsSinceOrigination 
##                               78713                                   0 
##                          LoanNumber                  LoanOriginalAmount 
##                                   0                                   0 
##                 LoanOriginationDate              LoanOriginationQuarter 
##                                   0                                   0 
##                  MonthlyLoanPayment                 LP_CustomerPayments 
##                                   0                                   0 
##        LP_CustomerPrincipalPayments                  LP_InterestandFees 
##                                   0                                   0 
##                      LP_ServiceFees                   LP_CollectionFees 
##                                   0                                   0 
##               LP_GrossPrincipalLoss                 LP_NetPrincipalLoss 
##                                   0                                   0 
##     LP_NonPrincipalRecoverypayments                       PercentFunded 
##                                   0                                   0 
##                     Recommendations          InvestmentFromFriendsCount 
##                                   0                                   0 
##         InvestmentFromFriendsAmount                           Investors 
##                                   0                                   0 
##                                Date 
##                                   0

So there is an obvious difference in the loan data pre-2009 and post-2009. For starters, the number of rows in each set is quite different. I shall have to make use of ratios and percentages to do my analysis.

Analysis

Categories of Loan

I want to see if there was a difference pre or post 2009 for which the loans were taken. Considering this is a discrete variable a bar chart would be a good choice for this type of analysis.

## 
##     0     1     2     3     4     5     6     7 
## 16945  5071   623  1874  2395   476   329  1260
## 
##     0     1     2     3     5     6     7     8     9    10    11    12 
##    20 53237  6810  5315   280  2243  9234   199    85    91   217    59 
##    13    14    15    16    17    18    19    20 
##  1996   876  1522   304    52   885   768   771

Oddly enough, individuals are taking loas for various different reasons after the crisis. This seems to be inline with the article here which highlights that since the recession is over and people have their jobs back, there seems to a trust in the industry again. So we see quite a few differences between the pre-2009 loan categories and the post-2009 loan categories.

  • Post Crisis, loans are taken for quite a few different reasons including luxury items such as buying a boat or an RV. Maybe prices are up and people could afford luxury items before
  • Pre crisis, we can see that, while we do not have the loan category available for almost 60% of the loans, there seems to still be a trend present i.e. that there is no luxury spendings. Loans were only taken out when there is an actual need and not as a luxury.
  • Debt consolidation seems to be the major reason for taking out a loan pre crisis and post crisis. Why do people take out so much loan for debt consolidation?

Debt consolidation is where someone obtains a new loan to pay out a number of smaller loans, debts, or bills that they are currently making payments on. In doing this they effectively bring all these debts together into one combined loan with one monthly payment. Since this is bringing multiple debts together and combining them into one loan, this is referred to as “consolidating” them. That’s why it’s called a debt consolidation loan.

Let’s look at some of the demographics of the people who have taken out debt consolidation loans.

What is the occupation of poeple who take out the debt consolidation loans?

There is a very little change in the trend of occupations of borrowers before and after the crisis. Professionals ask for the maximum percentage of loans. A thing to note however, of the top 10, 6 of the professions are high earning professions. Now lets look at it further and see that from the whole batch, how many have taken out debt consolidation loans?

This is where we see the change in the trends of Loans before and after the crisis. * Before the crisis, a very small percentage of the loans were debt consolidation loans, with professionals have about 5% of all loans be debt consolidation loans. * After the the crisis, we see a majority of the loans are debt consolidation loans in every occupation. Taking the professional category only, there is an increase from 5% to 21% showing how most of the loans after crisis were debt consolidation loans. From the categories graph we see that almost 60% of the total loans post crisis were Debt Consolidation loans.

Which State Has More Loans?

Heat Map plotting has been leanrt from here

## [1] "colorado"   "georgia"    "minnesota"  "new mexico" "kansas"    
## [6] "california"

The highest mean amount borrowed is by one of the smallest states i.e New Jersy while the rest of the states seem to have similar mean amounts borrowed

By the number of loans asked for, the California seems to have the highest number. New Jersey, with its high mean amount seems to be down the ladder with its low number of loan counts.

We see that many states have higher mean loan amounts as compared to before and even each other because there are many more lighter blues than before.

However, the number of loans by state seem to follow a similar patter as before with California leading the way in the number of loans after the crisis even though the mean loan amount was less.

Income Range Vs Borrower Rate

Common legend code from here

Income range was given in a range formats, hence I chose the stated monthly income and multiplied it by 12 to get yearly income, which is much easier to play aroud with. I compared the yearly income with the borrower rate to see if there’s any connection between the two. I see that it is all over the place i.e there is no connection present, so I added the IsBorrowerHomeOwner field to check if there’s any relation. I see that yearly income is a major factor in determining if the borrower is a homeowner or not. For both before and after the crisis, an income of less than 50000 per annum gives a very low probability of having a home. Borrower Rate vs income is all over the place though. Before the crisis, the borrower rate went to 0.5 but after the crisis the borrower rate is capped at 0.3 no matter what the income.

Income Range Vs Loan Amount Asked For

Since the credit Grade and the Prosper Rating are the same, only categorized before and after the crisis, I decided to see how the ratings are affected by a combination of loan amount vs income. While the scatter plot is scttered, there is a marked difference between the pre and post crisis plots. * Higher Loan amounts to for the same income range after the crisis which can indicate that people are asking for more loans than before. * A marked change in Prosper ratings/ Credit Ratings on the $4000, $5000, $10000 and $20000 marks on the Y axis after the crisis while before the crisis, there isn’t a marked change seen. Infact, almost all of the blue and purple points are below the $5000 mark which can highlight a more cautious way of giving a rating to the borrower.

Estimated Loss and Estiated Return based on Debt to Income Ration showing ProsperRating

Now we can clearly see what the Prosper rating after the crisis was dependant on i.e Estimated Loss. The more the loss that was estimated, the worse your Prosper Rating would be. Plots 1 and 3 are important in showing this relationship, where the color changes with the change of the Estimated Loss. This highlights that the rating was based on the Estimated Loss of that could be incurred from the borrower.

I also checked the relationship with the Estimated Return column, but it is clearly seen that the there is no connection to the Rating with the return with the coloured dots all over the graph.

Another highly interesting graph to see is the plot of EstimatedEffectiveYeild vs the Estimated Return Plot. Once again we see that there is a proper change in the colour i.e the Prosper Rating showing how the Rating is affected by the Yeild. An interesting observataion is that there is a solid straight line through the graph. This is when the Yeild equals to the return and when the Loss equals 0. The colors depend then on only the Effective Yeild and not the loss.

How many borrowers are also home owners?

## [1] "50.4471769486646 % of borrowers are also homeowners"

How many home owners have taken a debt consolidation loan?

## [1] "27.4142727998806 percent of the homeowners have also taken out debt consolidations loans"

What percentage of the borrowers are Debt Consolidation ?

## [1] "51.1756497011506 percent of the borrowers have taken out debt consolidations loans"

I did not think there was a need to make a visual to answer this question because of its obvious nature. I am looking at numbers and thus, I can see that loans for homeowners are there and many of the homeowners have more than one loan, thus there is a need to take out debt consolidation loans.

Let’s look at the actual loans and what they are affected by:

What are the summary statistics for the loan amounts approved?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2500    4500    6166    8000   25000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    7500    9077   13500   35000

Before the crisis, there loan amounts taken are less than after the crisis. While the base loan amount is the same for both, we see a higher range after the crisis, with a higher mean and median.

The simplist plot i.e the boxplot also highlights this aspect with the post crisis mean and median being higher than the pre- crisis mode and median.

What is the monthly income of the borrowers?

It seems that with the higher loan amounts of the borrowers, their income range has also increased after the crisis with the outliers so high, that I could not see the actual boxplot. It is after plotting the boxplot without the outliers do I see the trend that the income after the crisis is higher. This can be associated with the higer cost of living after the crisis and with the increase of the basic income needed to live along with inflation. All in all $1 in 2008 → $1.17 in 2018 showing how the cost of living has increased with the increase in income. This can also give reason to the higher loan prices.

On Time vs Less than one month late vs more than one month late payments

Apart from the lesser number of loans taken before the crisis, we see that a very small percentage of the loan payments were less than a month late and almost none of them were more than one month late before the crisis. After the crisis though, we see that a significant percentage of loan payments were late and many even more than one month late.

Time Vs Loans

For this duration, the median is less than the mean. An interesting thing to note is that there is a massive drop from the last 2008 quater to the mid 2009. These are the years of the Great Financial Crisis and this graph proves that the Loan trends were affected significantly by the Great Crisis of 2008.

Final Plots and Summary

The aim for the final plots is to combine a 2-4 plots into one to form a summary of the trends I have tried to highlight during the analysis.

Comparision Of Debt Consolidation Loans Before and After The Crisis.

Knowing that the debt consolidation loans make about 51.17% of the total borrowers, it is important to know what the demographics of these borrowers are. Their occupations is what I have plotted. It is also important to segament the data into before and after the crisis to highlight the drastic change in loan taking. Where debt consolidation was a minor part of the total loan categories before the Crisis, it became the number 1 reason for taking loans after the Crisis.

As seen from the graph above, where the light blue color highlights are percentage of debt consolidation loans while the dark blue color highlights the other reasons for taking loans, it can be seen wht debt consolidation makes up about 51% of the data. Prior to the crisis, there was no reason to take debt consoldation loans due to people’s income level being at par with their loan debt. However, as mentioned before, a debt consolidation loan combines all other loans into one big loan with low interest so that an individual only has to pay one monthly payment. After the crisis it can be seen that the loan taking has completely switched, with the majority being debt consolidation loans and only a minority being loans for other purposes. It should be mentioned as I know from plotting the graph before, that after the crisis people started taking loans for luxury items as well such as a boat, motor cycle and vacations. In light of this information, it is still astonishing to see the proportion of debt consolidation loans over other loans.

Coming to the occupations of the people who are asking for the loans, it can be seen that loans are mostly given to people with an already steady and high income such as Analyst, or computer programmer or Executive. These people have the highest earning jobs and thus can proof their income and ability to pay back their loans. However, it is the professional category that has the maximum number of loans in pre-crisis and post crisis categories.

Factors Affecting Prosper Rating (Applicable after July 2009)

Given the numerous facors given for the Prosper rating, I wanted to know which is the most important and what does the Prosper rating depend on. The data is majority after the crisis data i.e after July 2009 so I had enough data to make a conclusion.

From the given plot, which is an amalgamation of four plots which plots four different variables against each other, I can easily see that Propser Rating depends solely on Estimated Loss. The greater your estimated loss ratio, the worse your rating is given to be. Estimated loss is then calculated from a variety of different elements such as demographics (occupation, employement status, investors) and income (monthly income, other loans, number of delinquencies and amount delinquent)

I can also safely say that Prosper rating is dependant on estimated loss due to the proper lines formed when plotted with Estimated Loss while there is no lines seen with the plots of Estimated Return. This is also highlighted with the 4th graph of Estimted return vs Estimated Effective Yeild. It is obvious the two variables are correlated. An increase in Estimated Return should also give you an increase in Estimated Effective Yeild. However, we see that the two are not lineraly related. There’s a squared correlation. This is because of the Estimated Loss variablt which highlights how the Estimated Loss variable is vital in calculating the Prosper Rating.

Comparisions of State Incomes with State Loans

I wanted to check how state incomes are mapped with state loans and thus, I plotted the mean of the two giving me an indication of how the states handle debt.

This chart shows the recent debt per state in the US where California, Texas and NewYork have massive amounts of Debt.

The data given, only has Prosper oans till 2014 and thus, it is imporant to see the current world scenario. As mentioned before, the incomes and loan amounts have increased 10 fold after the crisis. In the chart, we see that the states with the highest average incomes are paying the least amounts of debt, while states such as California is which have low averge income are paying a high amount for their debt. This can show the start of the Dbet crisis that the world is not facing with more defaulters than on time paying individuals.

Conclusion and Reflection

I was interested in analyzing this particular set of data because of my interest in world affairs and the debt crisis. The Debt Crisis was particularly interesting because it has its causes stuck in the 1980s. Bad governemtn policies and irresponisble debt taking trends along with a lack of awareness seems to be the key reasons for this debt crisis. In 2008, U.S. households lost an estimated 18% of their net worth, equaling approximately an 11.2 trillion loss. In 2011, the U.S. debt reached 100% of its GDP and now the fedral goverments total debt stands at an enourmous 21.97 trillion. It’s all connected.

Thus, I undertook the effort to try and figure out the consequences of the Financial Crisis of 2008-09. I found it was tougher than expected, because I was unfamiliar with the syntax, the function calls, and the overall behavior of R. Maybe if I had done this in Jupyter Notebooks, it would be better as I am fimilair with it.

I was pleased with how easy it is to produce a simple map of the world in R using the states library. And plotting scatter plots and lines was easy enough and straight forward. But I spent a lot of time looking for advanced plotting packages that should make my life easier. After wasting a lot of time, I realized I felt compelled to code it myself and get the job done correctly. These are the problems of learning a new language so quickly.

I got hung up multiple times my code started behaving oddly. For example when I was plotting a the world map and the figure was not darwing and only the background was showing. I had to ask my fellow Bertlesmann Scholars for help and some of them were just as stuck on it as I was. On the other hand, it was nice to be able run code in a function, and access the results from the console.

At the end of it all, R programming is not a whole lot easier than Java or C (which I have done a lot of coding in), except for really simple data structures. If asked for a preference, I’d chose Python any day. However, given enough time with R as I have had with Python, it would probably be hard to decide.

I feel like I would have been able to do this project better in Python as I am more fimiliar with the syntax having being using the language as my main for the past couple of years and with the looming deadline, there was only a finite amount of time I could have spent on this project and trying to learn R.